Course Notes: most relevant material in Chapters 1-5.
Today’s R Setup
knitr::opts_chunk$set(comment =NA)library(janitor) # for tabyl, clean_names, and other thingslibrary(naniar) # deal with missing valueslibrary(broom) # for tidy, glance and augmentlibrary(car) # for boxCox and viflibrary(Epi) # for twoby2library(GGally) # for ggpairslibrary(knitr) # for kable to neaten tableslibrary(kableExtra) # to adjust font sizes in kableslibrary(MKinfer) # for boot.t.testlibrary(mosaic) # for favstatslibrary(patchwork) # for combining ggplotslibrary(vcd) # for mosaic (plot) and assoc (plot)library(easystats) # adds in lots of tools from easystats ecosystemlibrary(tidyverse) # for all kinds of things, always load lasttheme_set(theme_bw()) # another option I like is theme_lucid()
NHANES 1982 Example (see Course Notes: Chapters 1-5 for a very similar example)
# A tibble: 1,982 × 9
SEQN age educ sbp1 sbp2 sbp3 sroh hospital mentalh
<chr> <dbl> <fct> <dbl> <dbl> <dbl> <fct> <fct> <fct>
1 109266 29 College Grad 99 99 99 Good No No
2 109273 36 Some College/AA 116 110 115 Good No No
3 109291 42 College Grad 107 111 107 Fair Yes No
4 109297 30 Some College/AA 105 105 102 Very Good No No
5 109315 30 Some College/AA 118 123 125 Good No No
6 109317 28 Some College/AA 110 110 110 Very Good No No
7 109332 33 9th - 11th Grade 110 105 108 Excellent No Yes
8 109333 41 Some College/AA 106 107 113 Excellent No No
9 109336 35 College Grad 162 148 163 Good No No
10 109373 30 Some College/AA 111 111 113 Poor No No
# ℹ 1,972 more rows
2017 - March 2020 NHANES Data
1982 NHANES subjects ages 26-42 with complete data on these 9 variables:
Variable
Source
Description
SEQN
P-DEMO
Subject ID: Link (also in BPXO and HUQ)
age
P_DEMO
RIDAGEYR (restricted to ages 26-42 here)
educ
P_DEMO
DMDEDUC2 (five-category factor)
sbp1
BPXO
BPXOSY1 = 1st Systolic BP reading, in mm Hg
sbp2
BPXO
BPXOSY2 = 2nd Systolic BP reading
sbp3
BPXO
BPXOSY3 = 3rd Systolic BP reading
sroh
HUQ
HUQ010 = five-categories E, VG, G, F, P
hospital
HUQ
HUQ071 = Yes or No
mentalh
HUQ
HUQ090 = Yes or No
Variable Descriptions (without SEQN)
Variable
Description (n = 1982)
SEQN
Subject identification code from NHANES
age
Age in years (range 26-42, mean = 34)
educ
Educational Attainment in five categories (see next slide)
sbp1
Systolic Blood Pressure (1st reading)
sbp2
Systolic Blood Pressure (2nd reading)
sbp3
Systolic Blood Pressure (3rd reading)
sroh
Self-reported Overall Health: five categories (see next slide)
hospital
Yes if hospitalized in last 12m, else No (8% Yes)
mentalh
Yes if saw a mental health professional in last 12m, else No (12% Yes)
SROH and Educational Attainment
nh1982 |>tabyl(sroh) |>adorn_pct_formatting()
sroh n percent
Excellent 294 14.8%
Very Good 598 30.2%
Good 728 36.7%
Fair 321 16.2%
Poor 41 2.1%
nh1982 |>tabyl(educ) |>adorn_pct_formatting()
educ n percent
Less than 9th Grade 90 4.5%
9th - 11th Grade 209 10.5%
High School Grad 418 21.1%
Some College/AA 677 34.2%
College Grad 588 29.7%
In Analysis 1, we will compare the means of SBP1 and SBP2 for our 1982 participants.
In Analysis 2, we will compare the mean of SBP3 for our 159 participants who were hospitalized to the mean of SBP3 for our 1823 participants who were not hospitalized.
Which of these analyses uses paired samples, and why?
Let’s build a set of plots to describe the distribution of SBP_diff:
A histogram
A box-and-whisker plot with violins
A normal Q-Q plot
Paired SBP Differences
p1 <-ggplot(nh1982, aes(x = SBP_diff)) +geom_histogram(bins =20, col ="white", fill ="slateblue1") +labs(title ="Histogram", x ="SBP1 - SBP2, in mm Hg", y ="Frequency")p2 <-ggplot(nh1982, aes(sample = SBP_diff)) +geom_qq(col ="slateblue1") +geom_qq_line(col ="red") +labs(title ="Normal Q-Q plot", x ="Standard Normal Distribution",y ="Observed SBP1 - SBP2")p3 <-ggplot(nh1982, aes(x = SBP_diff, y ="")) +geom_violin(fill ="wheat") +geom_boxplot(width =0.4, fill ="slateblue1", outlier.size =2) +labs(title ="Boxplot with Violin", x ="SBP1 - SBP2 differences", y ="")(p1 + p2) / p3 +plot_layout(heights =c(3,1)) +plot_annotation(title ="SBP1 - SBP2 across 1982 NHANES participants")
Paired SBP Differences
Comparing Paired Samples
Want a 95% confidence interval for the true mean of the paired SBP1 - SBP2 differences:
t-based approach (linear model) assumes Normality
Wilcoxon signed rank approach doesn’t assume Normality but makes inferences about the pseudo-median
bootstrap doesn’t assume Normality, and describes mean
set.seed(20250116)boot.t.test(nh1982$SBP_diff, conf.level =0.95, boot =TRUE, R =999)
Results on the next two slides…
Bootstrap 95% CI
Estimate mean of (SBP1 - SBP2) for population based on sampled 1982 NHANES participants.
Sample mean SBP1 - SBP2 difference = 0.248
95% CI from bootstrap: (0.020, 0.496)
95% CI from t-based approach: (0.016, 0.481)
boot.t.test() from MKinfer package results on next slide
boot.t.test() results
Bootstrap One Sample t-test
data: nh1982$SBP_diff
number of bootstrap samples: 999
bootstrap p-value = 0.05205
bootstrap mean of x (SE) = 0.2507699 (0.1185597)
95 percent bootstrap percentile confidence interval:
0.01967709 0.49603935
Results without bootstrap:
t = 2.0931, df = 1981, p-value = 0.03646
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.01565269 0.48081552
sample estimates:
mean of x
0.2482341
Interpreting the Bootstrap CI
The confidence interval reflects imprecision in the population estimate, based only on assuming that the participants are selected at random from the population of interest.
When we generalize beyond study participants to the population they were selected at random from, then our data are compatible (at the 95% confidence level) with population means of SBP1 - SBP2 between 0.020 and 0.496, depending on the assumptions of our bootstrap procedure being correct.
Comparing sbp3 by hospital: Independent Samples
favstats(sbp3 ~ hospital, data = nh1982) |>kable(digits =2) |>kable_styling(font_size =24)
hospital
min
Q1
median
Q3
max
mean
sd
n
missing
Yes
88
104
113
124.5
195
116.71
18.50
159
0
No
60
107
115
124.0
204
116.11
14.51
1823
0
Our sample yields a point estimate for the “Hospitalized” - “Not Hospitalized” difference in means of 0.60 mm Hg.
Let’s draw a picture that lets us compare SBP3 values across the two groups.
Comparison Boxplot, with Violins
ggplot(nh1982, aes(x =factor(hospital), y = sbp3)) +geom_violin(aes(fill =factor(hospital))) +geom_boxplot(width =0.3, notch =TRUE) +guides(fill ="none") +labs(title ="SBP (3rd reading) by Hospitalization")
Comparison Boxplot, with Violins
Independent Samples: Comparing Means
Want a 90% confidence interval for the difference in means of SBP3 for those hospitalized - those not.
Pooled t-based approach (equivalent to linear model) assumes Normality and equal population variances
Welch t-based approach assumes Normality only
bootstrap assumes neither
Suppose we’re willing to assume both Normality and equal population variances…
Our sample yields a point estimate for the “Hospitalized” - “Not Hospitalized” difference in means of 0.60 mm Hg, with a 95% confidence interval of (-1.8, 3.0) mm Hg.
When we generalize beyond study participants to the population they were selected at random from, then our data are compatible (at the 95% confidence level) with a population mean difference (hospitalized - not hospitalized) in SBP3 values between -1.8 mm Hg and 3.0 mm Hg, depending on the assumptions of our linear model being correct.
sroh
educ Excellent Very Good Good Fair Poor Total
Less than 9th Grade 10 7 36 33 4 90
9th - 11th Grade 21 40 81 59 8 209
High School Grad 50 94 168 98 8 418
Some College/AA 72 220 264 104 17 677
College Grad 141 237 179 27 4 588
Total 294 598 728 321 41 1982
sroh
educ Excellent Very Good Good Fair Poor
Less than 9th Grade 11.1% 7.8% 40.0% 36.7% 4.4%
9th - 11th Grade 10.0% 19.1% 38.8% 28.2% 3.8%
High School Grad 12.0% 22.5% 40.2% 23.4% 1.9%
Some College/AA 10.6% 32.5% 39.0% 15.4% 2.5%
College Grad 24.0% 40.3% 30.4% 4.6% 0.7%
Total 14.8% 30.2% 36.7% 16.2% 2.1%
Mosaic Plot for our 5x5 Table
mosaic(~ educ + sroh, data = nh1982, highlighting ="sroh")
Pearson \(\chi^2\) test for our 5x5 Table
chisq.test(xtabs(~ educ + sroh, data = nh1982))
Pearson's Chi-squared test
data: xtabs(~educ + sroh, data = nh1982)
X-squared = 225.99, df = 16, p-value < 2.2e-16
# A tibble: 2 × 6
model fit_meansbp .fitted age sroh hospital
<chr> <dbl> <dbl> <dbl> <chr> <chr>
1 m1 112. 8.92 29 Good No
2 m2 112. 8.90 29 Good No
Setting Up Lab 1, due 2025-01-22 at Noon
Lab 1 Question 1
I provide some County Health Rankings data for 30 variables and 2054 counties included in the CHR 2024 report. You will filter the data down to the 88 counties in Ohio, and check for missing values.
Then you will create a visualization involving information from three different variables (from a list of 15) using R and Quarto.